After careful optimization of memory arrangement, four-fold speedup of the retrace step is possible using a NVIDIA GTX 650 GPU. The use of hardware-implemented transcendental functions on the GPU reduced the solution precision without significantly improving performance.

A ring method for communicating the logs between processors was developed and proved to be superior for scaling on large number of processors compared to using the intrinsic allgather function. An optimal batch size was found and it was shown that excessive batchsizes cause efficiency to denigrate. Increasing the number of perturbed models actually increased efficiency because of the better load balance of models among processors. 

